Credit Card Fraud Detection with Python¶

Business Problem:¶

Credit card fraud is the unauthorized use of a credit or debit card to make purchases. When it comes to credit card fraud, everyone pays the price. Consumers and the businesses that serve them all suffer from fraudulent activity. And the costs can be staggering. Global financial losses related to payment cards are estimated to reach $34.66 billion in 2022. Everyone along the payment lifecycle is impacted by a fraudulent transaction—from the consumer who makes purchases in person or online using a credit or debit card to the merchant who finalizes that purchase.

Credit card companies have an obligation to protect their customers’ finances and they employ fraud detection models to identify unusual financial activity and freeze a user’s credit card if transaction activity is out of the ordinary for a given individual. The penalty for mislabeling a fraud transaction as legitimate is having a user’s money stolen, which the credit card company typically reimburses. On the other hand, the penalty for mislabeling a legitimate transaction as fraud is having the user frozen out of their finances and unable to make payments. There is a very fine tradeoff between these two consequences and we will discuss how to handle this when training a model.

credit card fraud

Project Tasks:¶

  • Perform an exploratory data analysis to understand which features might be correlated to frauds
  • Create models with those features and test out their predicitve effectiveness.
  • Hyperparameters tuning.
  • Dealing with imbalanced dataset and improve model performance.

The Dataset:¶

The dataset is from Kaggle. This is a simulated credit card transaction dataset containing legitimate and fraud transactions from the duration 1st Jan 2019 - 31st Dec 2020. It covers credit cards of 1000 customers doing transactions with a pool of 800 merchants. (Thanks to Brandon Harris for his amazing work in creating this easy-to-use simulation tool for creating fraud transaction datasets.) There are 23 columns and 1,296,675 rows.

  • "Unnamed:0" id of the record
  • trans_date_trans_time
  • cc_num:
  • merchant: merchat name
  • category: transaction category
  • amt: transaction amount
  • first: first name
  • last: last name
  • gender: "F" , "M"
  • street: street address
  • city
  • state
  • zip: zip code
  • lat: latitudinal
  • long: longitudinal
  • city_pop: city population
  • job: career
  • trans_num: transaction number
  • unix_time: unix format time stamp
  • merch_lat: latitudinal of merchant
  • merch_long: longitudinal of merchant
  • is_fraud: fraudulent transactions as 1 and non-fraudulent as 0

Map

In [1]:
pip install plotly
Requirement already satisfied: plotly in /opt/anaconda3/lib/python3.9/site-packages (5.6.0)
Requirement already satisfied: six in /opt/anaconda3/lib/python3.9/site-packages (from plotly) (1.16.0)
Requirement already satisfied: tenacity>=6.2.0 in /opt/anaconda3/lib/python3.9/site-packages (from plotly) (8.0.1)
Note: you may need to restart the kernel to use updated packages.
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression # For Logistic Regression Model
from sklearn.tree import DecisionTreeClassifier # For Desicion Tree Classification Model
from sklearn.ensemble import RandomForestClassifier # For Random Forest Classification Model
from sklearn.model_selection import GridSearchCV # For hyperparameters tuning
from sklearn.preprocessing import LabelEncoder # For converted categorical variables to numerical variables
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.metrics import accuracy_score,f1_score,precision_score,recall_score,roc_auc_score
import plotly.express as px

Data Validation¶

In [3]:
fraud = pd.read_csv('fraud.csv')
fraud.head()
Out[3]:
Unnamed: 0 trans_date_trans_time cc_num merchant category amt first last gender street ... lat long city_pop job dob trans_num unix_time merch_lat merch_long is_fraud
0 0 2019-01-01 00:00:18 2703186189652095 fraud_Rippin, Kub and Mann misc_net 4.97 Jennifer Banks F 561 Perry Cove ... 36.0788 -81.1781 3495 Psychologist, counselling 1988-03-09 0b242abb623afc578575680df30655b9 1325376018 36.011293 -82.048315 0
1 1 2019-01-01 00:00:44 630423337322 fraud_Heller, Gutmann and Zieme grocery_pos 107.23 Stephanie Gill F 43039 Riley Greens Suite 393 ... 48.8878 -118.2105 149 Special educational needs teacher 1978-06-21 1f76529f8574734946361c461b024d99 1325376044 49.159047 -118.186462 0
2 2 2019-01-01 00:00:51 38859492057661 fraud_Lind-Buckridge entertainment 220.11 Edward Sanchez M 594 White Dale Suite 530 ... 42.1808 -112.2620 4154 Nature conservation officer 1962-01-19 a1a22d70485983eac12b5b88dad1cf95 1325376051 43.150704 -112.154481 0
3 3 2019-01-01 00:01:16 3534093764340240 fraud_Kutch, Hermiston and Farrell gas_transport 45.00 Jeremy White M 9443 Cynthia Court Apt. 038 ... 46.2306 -112.1138 1939 Patent attorney 1967-01-12 6b849c168bdad6f867558c3793159a81 1325376076 47.034331 -112.561071 0
4 4 2019-01-01 00:03:06 375534208663984 fraud_Keeling-Crist misc_pos 41.96 Tyler Garcia M 408 Bradley Rest ... 38.4207 -79.4629 99 Dance movement psychotherapist 1986-03-28 a41d7549acf90789359a9aa5346dcb46 1325376186 38.674999 -78.632459 0

5 rows × 23 columns

In [4]:
fraud.info()
# no missing value
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 23 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   Unnamed: 0             1296675 non-null  int64  
 1   trans_date_trans_time  1296675 non-null  object 
 2   cc_num                 1296675 non-null  int64  
 3   merchant               1296675 non-null  object 
 4   category               1296675 non-null  object 
 5   amt                    1296675 non-null  float64
 6   first                  1296675 non-null  object 
 7   last                   1296675 non-null  object 
 8   gender                 1296675 non-null  object 
 9   street                 1296675 non-null  object 
 10  city                   1296675 non-null  object 
 11  state                  1296675 non-null  object 
 12  zip                    1296675 non-null  int64  
 13  lat                    1296675 non-null  float64
 14  long                   1296675 non-null  float64
 15  city_pop               1296675 non-null  int64  
 16  job                    1296675 non-null  object 
 17  dob                    1296675 non-null  object 
 18  trans_num              1296675 non-null  object 
 19  unix_time              1296675 non-null  int64  
 20  merch_lat              1296675 non-null  float64
 21  merch_long             1296675 non-null  float64
 22  is_fraud               1296675 non-null  int64  
dtypes: float64(5), int64(6), object(12)
memory usage: 227.5+ MB
In [5]:
fraud.describe()
Out[5]:
Unnamed: 0 cc_num amt zip lat long city_pop unix_time merch_lat merch_long is_fraud
count 1.296675e+06 1.296675e+06 1.296675e+06 1.296675e+06 1.296675e+06 1.296675e+06 1.296675e+06 1.296675e+06 1.296675e+06 1.296675e+06 1.296675e+06
mean 6.483370e+05 4.171920e+17 7.035104e+01 4.880067e+04 3.853762e+01 -9.022634e+01 8.882444e+04 1.349244e+09 3.853734e+01 -9.022646e+01 5.788652e-03
std 3.743180e+05 1.308806e+18 1.603160e+02 2.689322e+04 5.075808e+00 1.375908e+01 3.019564e+05 1.284128e+07 5.109788e+00 1.377109e+01 7.586269e-02
min 0.000000e+00 6.041621e+10 1.000000e+00 1.257000e+03 2.002710e+01 -1.656723e+02 2.300000e+01 1.325376e+09 1.902779e+01 -1.666712e+02 0.000000e+00
25% 3.241685e+05 1.800429e+14 9.650000e+00 2.623700e+04 3.462050e+01 -9.679800e+01 7.430000e+02 1.338751e+09 3.473357e+01 -9.689728e+01 0.000000e+00
50% 6.483370e+05 3.521417e+15 4.752000e+01 4.817400e+04 3.935430e+01 -8.747690e+01 2.456000e+03 1.349250e+09 3.936568e+01 -8.743839e+01 0.000000e+00
75% 9.725055e+05 4.642255e+15 8.314000e+01 7.204200e+04 4.194040e+01 -8.015800e+01 2.032800e+04 1.359385e+09 4.195716e+01 -8.023680e+01 0.000000e+00
max 1.296674e+06 4.992346e+18 2.894890e+04 9.978300e+04 6.669330e+01 -6.795030e+01 2.906700e+06 1.371817e+09 6.751027e+01 -6.695090e+01 1.000000e+00
In [6]:
fraud.shape
Out[6]:
(1296675, 23)

Exploratory Data Analysis & Feature Engineering¶

Target Variable: is_fraud¶

In [7]:
# Target Variable - is_fraud
ax = sns.countplot(x = 'is_fraud', data = fraud)
for p in ax.patches:
   ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.33, p.get_height()+160))
fraud.is_fraud.mean()
# The fraud rate is 0.58%. It is an imbalanced dataset.
Out[7]:
0.005788651743883394
In [4]:
labels=["Genuine","Fraud"]

fraud_or_not = fraud["is_fraud"].value_counts().tolist()
values = [fraud_or_not[0], fraud_or_not[1]]

fig = px.pie(values=fraud['is_fraud'].value_counts(), names=labels , width=700, height=400, color_discrete_sequence=["skyblue","purple"]
             ,title="Fraud vs Genuine Transactions")
fig.show()
In [9]:
#Plotting the heat map to find the correlation between the columns
plt.figure(figsize=(15,10))
sns.heatmap(fraud.corr(),annot=True, linewidths=0.5, cmap = "Blues")
plt.show()

Categorical Variables: merchant, cetegory, gender, city, state,job, trans_week_day,trans_hour¶

In [10]:
# merchant
fraud[fraud.is_fraud == 1].merchant.value_counts(sort =True, ascending = False).head(10).plot(kind = 'bar')
plt.title("Number of Credit Card Fraud by Merchant (Top 10)")
plt.show()
fraud[fraud.is_fraud == 1].merchant.value_counts()
# Insight: merchant could be a predictor, fraud_Rau and Sons, Fraud Cormier LLC have fraud_Kozey-Boehm.. have higher fraud activities. 
Out[10]:
fraud_Rau and Sons        49
fraud_Cormier LLC         48
fraud_Kozey-Boehm         48
fraud_Doyle Ltd           47
fraud_Vandervort-Funk     47
                          ..
fraud_Kuphal-Toy           1
fraud_Eichmann-Kilback     1
fraud_Lynch-Mohr           1
fraud_Tillman LLC          1
fraud_Hills-Olson          1
Name: merchant, Length: 679, dtype: int64
In [11]:
# category:
sns.countplot(x= 'category',data = fraud[fraud.is_fraud == 1])
plt.title("Number of Credit Card Fraud by Transaction Category")
plt.xticks(rotation=80)
plt.show()

# Insight: category could be a good predictor. shoping_net and grocery_pos seem to have relative higher fraud acitivites.
In [12]:
# Gender
sns.countplot(x= 'gender',data = fraud[fraud.is_fraud == 1])
plt.title("Number of Credit Card Fraud by Gender")
plt.show()
# Insight: hard to tell the difference from gender
In [13]:
#Relation between Gender and Fraud
ax=sns.histplot(x='gender',data=fraud, hue='is_fraud',stat='percent',multiple='dodge',common_norm=False)
ax.set_ylabel('Percentage')
ax.set_xlabel('Credit Card Holder Gender')
plt.legend(title='Type', labels=['Fraud', 'Not Fraud'])
plt.title("Relation between Gender and Fraud")
plt.show()
In [14]:
# State
plt.figure(figsize=(20,8))
sns.countplot(x= 'state',data = fraud[fraud.is_fraud == 1])
plt.xticks(rotation=90)
plt.title("Number of Credit Card Fraud by State")
plt.show()
# Insight: state could be a good predictor. States NY, TX,PA report the most number of fraud.
In [6]:
#plot a geographical map of United States with different color schemes showing the intensity of fraud transactions that happened
import plotly.express as px
df = fraud.groupby('state').sum()['is_fraud'].to_frame()
df.reset_index(inplace =True)
df = df.rename(columns= {'state':'State', 'is_fraud':'Fraud Transactions'})
fig = px.choropleth(df,
                    locations='State',
                    color='Fraud Transactions',
                    locationmode='USA-states',
                    color_continuous_scale="Pinkyl",
                    labels={'States':'Fraud Transactions'},
                    scope='usa')
fig.add_scattergeo(
    locations=df['State'],
    locationmode='USA-states',
    text=df['State'],
    mode='text'
)
fig.show()
In [16]:
# City:
fraud[fraud.is_fraud == 1].city.value_counts(sort =True, ascending = False).head(10).plot(kind = 'bar')
plt.title("Number of Credit Card Fraud by City (Top 10)")
plt.show()
fraud[fraud.is_fraud == 1].city.value_counts()
# Insight:city could be a good predictor. Cities Houston, Warren, Huntsville report the most number of fraud.
Out[16]:
Houston           39
Warren            33
Huntsville        29
Naples            29
Dallas            27
                  ..
Florence           3
Kilgore            2
Phoenix            2
Phenix City        2
Denham Springs     2
Name: city, Length: 702, dtype: int64
In [17]:
# Job
fraud[fraud.is_fraud == 1].job.value_counts(sort =True, ascending = False).head(10).plot(kind = 'bar')
plt.title("Number of Credit Card Fraud by Job (Top 10)")
plt.show()
fraud[fraud.is_fraud == 1].job.value_counts()
# Insight: Job could be a good predictor. Materials engineer, Trading starndards officer, Navel architect report the most number of fraud.
Out[17]:
Materials engineer                      62
Trading standards officer               56
Naval architect                         53
Exhibition designer                     51
Surveyor, land/geomatics                50
                                        ..
Statistician                             3
Health physicist                         3
Chartered loss adjuster                  3
English as a second language teacher     2
Contractor                               2
Name: job, Length: 443, dtype: int64
In [18]:
# Generate some new Categorical Variables:
# convert trans_data_trans_time from str to datatime format
fraud['trans_date'] = pd.to_datetime(fraud['trans_date_trans_time'], format = "%Y-%m-%d %H:%M:%S")
In [19]:
# extract the transaction month of the year
fraud['trans_month'] = fraud['trans_date'].dt.month
sns.countplot(x= 'trans_month',data = fraud[fraud.is_fraud == 1])
plt.title("Number of Credit Card Fraud by Month")
plt.show()
# Insight: trans_month could be a predictor.
In [20]:
# extract the transaction day of week
fraud['trans_week_day'] = fraud['trans_date'].dt.day_name()
sns.countplot(x= 'trans_week_day',data = fraud[fraud.is_fraud == 1])
plt.title("Number of Credit Card Fraud by Day of Week")
plt.show()

#Insight: trans_week_day could be a good predictor, Saturday, Sunday,Monday report the most number of fraud.
In [21]:
# extract the transaction hour of day
fraud['trans_hour']=fraud['trans_date'].dt.hour
plt.figure(figsize=(20,8))
sns.countplot(x= 'trans_hour',data = fraud[fraud.is_fraud == 1])
plt.title("Number of Credit Card Fraud by hour")
plt.show()
#Insight: trans_hour could be a really good predictor, Hours of 22, 23, 0, 1, 2, 3 report the most number of fraud.

Numerical Variables: amt, lat, long, dob, merch_lat, merch_long¶

In [22]:
#amount vs fraud
ax=sns.histplot(x='amt',data=fraud[fraud.amt<=1000],hue='is_fraud',stat='percent',multiple='dodge',common_norm=False,bins=25)
ax.set_ylabel('Percentage in Each Type')
ax.set_xlabel('Transaction Amount in USD')
plt.legend(title='Type', labels=['Fraud', 'Not Fraud'])
plt.show()
# Insight: amount could be a good predictor
In [23]:
#age
fraud['age']=fraud['trans_date'].dt.year-pd.to_datetime(fraud['dob']).dt.year
ax=sns.kdeplot(x='age',data=fraud, hue='is_fraud', common_norm=False)
ax.set_xlabel('Credit Card Holder Age')
ax.set_ylabel('Density')
plt.xticks(np.arange(0,110,5))
plt.title('Age Distribution in Fraudulent vs Non-Fraudulent Transactions')
plt.legend(title='Type', labels=['Fraud', 'Not Fraud'])
plt.show()
In [24]:
ax=sns.histplot(x='age',data=fraud,hue='is_fraud',stat='percent',multiple='dodge',common_norm=False,bins=25)
ax.set_ylabel('Percentage in Each Type')
ax.set_xlabel('Credit Card Holder Age')
plt.legend(title='Type', labels=['Fraud', 'Not Fraud'])
plt.show()
# Insight: Age could be a predictor.
In [25]:
pip install h3
Collecting h3
  Downloading h3-3.7.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 62.9 MB/s eta 0:00:00
Installing collected packages: h3
Successfully installed h3-3.7.6
Note: you may need to restart the kernel to use updated packages.
In [26]:
# calculate distanc by using lat, long, merch_lat, merch_long
# need Latitude and Longitude to calculate the distance between two locations with following formula: =acos(sin(lat1)*sin(lat2)+cos(lat1)*cos(lat2)*cos(lon2-lon1))*6371 
import h3
fraud['distance']= fraud.apply(lambda row: h3.point_dist((row['lat'],row['long']),(row['merch_lat'],row['merch_long'])),axis=1)
In [27]:
ax=sns.histplot(x='distance',data=fraud,hue='is_fraud',stat='percent',multiple='dodge',common_norm=False,bins=25)
ax.set_ylabel('Percentage in Each Type')
ax.set_xlabel('Distance')
plt.legend(title='Type', labels=['Fraud', 'Not Fraud'])
plt.show()
# Insight: distance could be a predictor.
In [28]:
fig, axes = plt.subplots(1,3,figsize=(20,5))
sns.boxplot(x =fraud.is_fraud,y=fraud[fraud.amt<=1000].amt, ax=axes[0]).set(title='Fraud vs Transaction Amount')
sns.boxplot(x =fraud.is_fraud,y=fraud.age,  ax=axes[1]).set(title='Fraud vs Age')
sns.boxplot(x =fraud.is_fraud,y=fraud.distance, ax=axes[2]).set(title='Fraud vs Distance')
plt.show()

# Insight: amt could be a pretty good predictor, age could be a predictor, distance is hard to tell from boxplot, mostly overlap. 

Preprocessing Data¶

In [29]:
fraud.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 29 columns):
 #   Column                 Non-Null Count    Dtype         
---  ------                 --------------    -----         
 0   Unnamed: 0             1296675 non-null  int64         
 1   trans_date_trans_time  1296675 non-null  object        
 2   cc_num                 1296675 non-null  int64         
 3   merchant               1296675 non-null  object        
 4   category               1296675 non-null  object        
 5   amt                    1296675 non-null  float64       
 6   first                  1296675 non-null  object        
 7   last                   1296675 non-null  object        
 8   gender                 1296675 non-null  object        
 9   street                 1296675 non-null  object        
 10  city                   1296675 non-null  object        
 11  state                  1296675 non-null  object        
 12  zip                    1296675 non-null  int64         
 13  lat                    1296675 non-null  float64       
 14  long                   1296675 non-null  float64       
 15  city_pop               1296675 non-null  int64         
 16  job                    1296675 non-null  object        
 17  dob                    1296675 non-null  object        
 18  trans_num              1296675 non-null  object        
 19  unix_time              1296675 non-null  int64         
 20  merch_lat              1296675 non-null  float64       
 21  merch_long             1296675 non-null  float64       
 22  is_fraud               1296675 non-null  int64         
 23  trans_date             1296675 non-null  datetime64[ns]
 24  trans_month            1296675 non-null  int64         
 25  trans_week_day         1296675 non-null  object        
 26  trans_hour             1296675 non-null  int64         
 27  age                    1296675 non-null  int64         
 28  distance               1296675 non-null  float64       
dtypes: datetime64[ns](1), float64(6), int64(9), object(13)
memory usage: 286.9+ MB
In [30]:
#prepare data for modeling

# convert categorical variables to numerical format
labelencoder = LabelEncoder()
fraud['merchant']=labelencoder.fit_transform(fraud['merchant'])
fraud['category']=labelencoder.fit_transform(fraud['category'])
fraud['gender']=labelencoder.fit_transform(fraud['gender'])
fraud['city']=labelencoder.fit_transform(fraud['city'])
fraud['state']=labelencoder.fit_transform(fraud['state'])
fraud['job']=labelencoder.fit_transform(fraud['job'])
fraud['trans_week_day']=labelencoder.fit_transform(fraud['trans_week_day'])
In [31]:
feature_cols = ['merchant','category', 'gender','city', 'state', 'job','trans_month','trans_week_day','trans_hour','age','distance','amt']
X = fraud[feature_cols] # Features
y = fraud['is_fraud'] # Target variable

# define the scaler 
scaler = MinMaxScaler()
# fit and transform the train set
X[['age', 'distance','amt']] = scaler.fit_transform(X[['age', 'distance','amt']])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Model Fitting & Hyperparameters Tuning & Evaluation¶

Decision Tree¶

In [32]:
# Create Decision Tree classifer object
dtm = DecisionTreeClassifier(criterion="entropy", max_depth=8)

# Train Decision Tree Classifer
dtm = dtm.fit(X_train,y_train)

Hyperparameters Tuning for Desicion Tree model¶

In [33]:
# Hyperparameters tuning for Dasicion Tree model
train_score = []
test_score = []
max_score = 0
max_pair = (0,0)

for i in range(1,50):
    tree = DecisionTreeClassifier(max_depth=i,random_state=42)
    tree.fit(X_train,y_train)
    y_pred = tree.predict_proba(X_train)[:,1]
    y_pred_t = tree.predict_proba(X_test)[:,1]
    train_score.append(metrics.roc_auc_score(y_train,y_pred))
    test_score.append(metrics.roc_auc_score(y_test, y_pred_t))
    test_pair = (i,metrics.roc_auc_score(y_test,y_pred_t))
    if test_pair[1] > max_pair[1]:
        max_pair = test_pair

fig, ax = plt.subplots()

ax.plot(np.arange(1,50), train_score, label = "roc_auc_score",color='purple')
ax.plot(np.arange(1,50), test_score, label = "roc_auc_score",color='lime')
print(f'Best max_depth is: {max_pair[0]} \nroc_auc_score is: {max_pair[1]}')
Best max_depth is: 8 
roc_auc_score is: 0.9843941728418165
In [34]:
#define metrics
y_pred_proba = dtm.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)

#create ROC curve
plt.plot(fpr,tpr,label="AUC="+str(auc))
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()

# Calculate the G-mean
gmean = np.sqrt(tpr * (1 - fpr)) # using G-mean

# Find the optimal threshold
index = np.argmax(gmean)
thresholdOpt = round(thresholds[index], ndigits = 4)
gmeanOpt = round(gmean[index], ndigits = 4)
fprOpt = round(fpr[index], ndigits = 4)
tprOpt = round(tpr[index], ndigits = 4)
print('Best Threshold: {} with G-Mean: {}'.format(thresholdOpt, gmeanOpt))
print('FPR: {}, TPR: {}'.format(fprOpt, tprOpt))
Best Threshold: 0.0064 with G-Mean: 0.9625
FPR: 0.0286, TPR: 0.9536
In [35]:
#Predict the response for test dataset
# select the right threshold to make sure the recall of "1" category is higher
threshold = 0.0286
y_pred = (dtm.predict_proba(X_test)[:, 1] > threshold).astype('float')

dtm_matrix = metrics.confusion_matrix(y_test, y_pred)
print(dtm_matrix)
dtm_report = metrics.classification_report(y_test,y_pred)
print(dtm_report)
[[382257   4461]
 [   169   2116]]
              precision    recall  f1-score   support

           0       1.00      0.99      0.99    386718
           1       0.32      0.93      0.48      2285

    accuracy                           0.99    389003
   macro avg       0.66      0.96      0.74    389003
weighted avg       1.00      0.99      0.99    389003

In [36]:
#Predict the response for test dataset
# select the right threshold to make sure the F1-score is higher
threshold = 0.25
y_pred = (dtm.predict_proba(X_test)[:, 1] > threshold).astype('float')

np.set_printoptions(precision=1) 
dtm_matrix = metrics.confusion_matrix(y_test, y_pred)
print(dtm_matrix)
dtm_report = metrics.classification_report(y_test,y_pred,digits=4)
print(dtm_report)
[[386511    207]
 [   556   1729]]
              precision    recall  f1-score   support

           0     0.9986    0.9995    0.9990    386718
           1     0.8931    0.7567    0.8192      2285

    accuracy                         0.9980    389003
   macro avg     0.9458    0.8781    0.9091    389003
weighted avg     0.9979    0.9980    0.9980    389003

In [37]:
resultdict = {}
for i in range(len(feature_cols)):
    resultdict[feature_cols[i]] = dtm.feature_importances_[i]
    
plt.bar(resultdict.keys(),resultdict.values())
plt.xticks(rotation='vertical')
plt.title('Feature Importance in Decision Tree Model')

# amt, category,trans_hour, age,gender
Out[37]:
Text(0.5, 1.0, 'Feature Importance in Decision Tree Model')

Random Forest¶

In [38]:
rf = RandomForestClassifier(random_state = 42, n_estimators=1000, bootstrap = True, max_depth=10,criterion='entropy')
rf.fit(X_train, y_train)
Out[38]:
RandomForestClassifier(criterion='entropy', max_depth=10, n_estimators=1000,
                       random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(criterion='entropy', max_depth=10, n_estimators=1000,
                       random_state=42)
In [39]:
#define metrics

y_pred_proba = rf.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)

#create ROC curve
plt.plot(fpr,tpr,label="AUC="+str(auc))
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()

# Calculate the G-mean
gmean = np.sqrt(tpr * (1 - fpr)) # using G-mean

# Find the optimal threshold
index = np.argmax(gmean)
thresholdOpt = round(thresholds[index], ndigits = 4)
gmeanOpt = round(gmean[index], ndigits = 4)
fprOpt = round(fpr[index], ndigits = 4)
tprOpt = round(tpr[index], ndigits = 4)
print('Best Threshold: {} with G-Mean: {}'.format(thresholdOpt, gmeanOpt))
print('FPR: {}, TPR: {}'.format(fprOpt, tprOpt))
Best Threshold: 0.0118 with G-Mean: 0.9692
FPR: 0.0305, TPR: 0.9689

Hyperparameters Tuning for Random Forest Model¶

In [40]:
# Create the parameter grid based on the results of random search 

param_grid = {
    'bootstrap': [True],
    'max_depth': [5, 10, 20],
    'n_estimators': [200, 500, 1000]
}
# Create a based model
rf_t = RandomForestClassifier(random_state = 42)
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf_t, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2, scoring = 'roc_auc')
In [ ]:
# Fit the grid search to the data.  
# grid_search.fit(X_train, y_train)
# grid_search.best_params_
# could take a long time (>100mins)
In [42]:
y_pred2 = rf.predict(X_test)
rf_matrix = metrics.confusion_matrix(y_test, y_pred2)
print(rf_matrix)

rf_report = metrics.classification_report(y_test,y_pred2)
print(rf_report)
[[386694     24]
 [   893   1392]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    386718
           1       0.98      0.61      0.75      2285

    accuracy                           1.00    389003
   macro avg       0.99      0.80      0.88    389003
weighted avg       1.00      1.00      1.00    389003

In [43]:
#Predict the response for test dataset
# select the right threshold to make sure the F1-score is higher
threshold = 0.25
y_pred2 = (rf.predict_proba(X_test)[:, 1] > threshold).astype('float')

rf_matrix = metrics.confusion_matrix(y_test, y_pred2)
print(rf_matrix)
rf_report = metrics.classification_report(y_test,y_pred2,digits=4)
print(rf_report)
[[386502    216]
 [   492   1793]]
              precision    recall  f1-score   support

           0     0.9987    0.9994    0.9991    386718
           1     0.8925    0.7847    0.8351      2285

    accuracy                         0.9982    389003
   macro avg     0.9456    0.8921    0.9171    389003
weighted avg     0.9981    0.9982    0.9981    389003

In [44]:
resultdict = {}
for i in range(len(feature_cols)):
    resultdict[feature_cols[i]] = rf.feature_importances_[i]
    
plt.bar(resultdict.keys(),resultdict.values())
plt.xticks(rotation='vertical')
plt.title('Feature Importance in Random Forest Model')

# amt, trans_hour, category, age....
Out[44]:
Text(0.5, 1.0, 'Feature Importance in Random Forest Model')

Model Comparison¶

In [45]:
#set up plotting area
plt.figure(0).clf()
#fit decisiom tree model and plot ROC curve
y_pred = dtm.predict_proba(X_test)[:, 1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred)
auc = round(metrics.roc_auc_score(y_test, y_pred), 4)
plt.plot(fpr,tpr,label="Decision Tree, AUC="+str(auc))

#fit random forest model and plot ROC curve

y_pred = rf.predict_proba(X_test)[:, 1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred)
auc = round(metrics.roc_auc_score(y_test, y_pred), 4)
plt.plot(fpr,tpr,label="Random Forest, AUC="+str(auc))
plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
plt.title(" AUC Comparison")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
#add legend
plt.legend()
Out[45]:
<matplotlib.legend.Legend at 0x7fe58eff48e0>

Model Comparison

Metric/Model Decision Tree Random Forest
Accuracy 0.9980 0.9987
Precision 0.8926 0.8925
Recall 0.7567 0.7847
F1-Score 0.8190 0.8351
AUC 0.9842 0.9937

Based on these metrics, randome forest model has better performace. We will choose the Random Forest Model.

Dealing with Imbalanced Dataset¶

Imbalanced data refers to a situation, primarily in classification machine learning, where one target class represents a significant proportion of observations. Imbalanced datasets are those where there is a severe skew in the class distribution, such as 1:100 or 1:1000 examples in the minority class to the majority class. There are several approaches to solving class imbalance problem before starting classification, such as:

  • Random Oversampling the minority class
  • Randm Undersamling the majority class
  • SMOTE (Synthetic Minority Oversampling Technique)
  • Undersampling using Tomek Links
  • Combining SMOTE and Tomek Links
  • Class Weights in the models

    This section just shows how to deal with imbalanced dataset. We will only use 1% of the whole dataset to improve the speed of modeling and hypermeters tuning.

In [63]:
data = fraud.sample(frac=0.1, random_state = 42)
data.head()
Out[63]:
Unnamed: 0 trans_date_trans_time cc_num merchant category amt first last gender street city state zip lat long city_pop job dob trans_num unix_time merch_lat merch_long is_fraud trans_date trans_month trans_week_day trans_hour age distance
1045211 1045211 2020-03-09 15:09:26 577588686219 629 9 194.51 James Strickland 1 25454 Leonard Lake 768 38 15686 40.6153 -79.4545 972 378 1997-10-23 fff87d4340ef756a592eac652493cf6b 1362841766 40.420453 -78.865012 0 2020-03-09 15:09:26 3 1 15 23 54.336180
547406 547406 2019-08-22 15:49:01 30376238035123 180 5 52.32 Cynthia Davis 0 7177 Steven Forges 750 37 97476 42.8250 -124.4409 217 400 1928-10-01 d0ad335af432f35578eea01d639b3621 1345650541 42.758860 -123.636337 0 2019-08-22 15:49:01 8 4 15 91 66.060940
110142 110142 2019-03-04 01:34:16 4658490815480264 429 12 6.53 Tara Richards 0 4879 Cristina Station 400 38 15449 39.9636 -79.7853 184 444 1945-11-04 87f26e3ea33f4ff4c7a8bad2c7f48686 1330824856 40.475159 -78.898190 0 2019-03-04 01:34:16 3 1 1 74 94.386151
1285953 1285953 2020-06-16 20:04:38 3514897282719543 187 6 7.33 Steven Faulkner 1 841 Cheryl Centers Suite 115 262 34 14425 42.9580 -77.3083 10717 115 1952-10-13 9c34015321c0fa2ae6fd20f9359d1d3e 1371413078 43.767506 -76.542384 0 2020-06-16 20:04:38 6 5 20 68 109.251413
271705 271705 2019-05-14 05:54:48 6011381817520024 92 2 64.29 Kristen Allen 0 8619 Lisa Manors Apt. 871 419 50 82221 41.6423 -104.1974 635 358 1973-07-13 198437c05676f485e9be04449c664475 1336974888 41.040392 -104.092324 0 2019-05-14 05:54:48 5 5 5 46 67.501592
In [61]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 129668 entries, 1045211 to 879092
Data columns (total 29 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   Unnamed: 0             129668 non-null  int64         
 1   trans_date_trans_time  129668 non-null  object        
 2   cc_num                 129668 non-null  int64         
 3   merchant               129668 non-null  int64         
 4   category               129668 non-null  int64         
 5   amt                    129668 non-null  float64       
 6   first                  129668 non-null  object        
 7   last                   129668 non-null  object        
 8   gender                 129668 non-null  int64         
 9   street                 129668 non-null  object        
 10  city                   129668 non-null  int64         
 11  state                  129668 non-null  int64         
 12  zip                    129668 non-null  int64         
 13  lat                    129668 non-null  float64       
 14  long                   129668 non-null  float64       
 15  city_pop               129668 non-null  int64         
 16  job                    129668 non-null  int64         
 17  dob                    129668 non-null  object        
 18  trans_num              129668 non-null  object        
 19  unix_time              129668 non-null  int64         
 20  merch_lat              129668 non-null  float64       
 21  merch_long             129668 non-null  float64       
 22  is_fraud               129668 non-null  int64         
 23  trans_date             129668 non-null  datetime64[ns]
 24  trans_month            129668 non-null  int64         
 25  trans_week_day         129668 non-null  int64         
 26  trans_hour             129668 non-null  int64         
 27  age                    129668 non-null  int64         
 28  distance               129668 non-null  float64       
dtypes: datetime64[ns](1), float64(6), int64(16), object(6)
memory usage: 29.7+ MB
In [64]:
data.is_fraud.mean()

## fraud: 0.6%
Out[64]:
0.005961378289169263
In [65]:
feature_cols = ['merchant','category', 'gender','city', 'state', 'job','trans_month','trans_week_day','trans_hour','age','distance','amt']
X = data[feature_cols] # Features
y = data['is_fraud'] # Target variable

# define the scaler 
scaler = MinMaxScaler()
# fit and transform the train set
X[['age', 'distance','amt']] = scaler.fit_transform(X[['age', 'distance','amt']])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Base Model : Random Forest¶

In [133]:
# Create the parameter grid based on the results of random search 

param_grid = {
    'bootstrap': [True],
    'max_depth': [4, 6, 8, 10],
    'n_estimators': [50, 100, 200]
}
# Create a based model
rf = RandomForestClassifier(random_state = 42)
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2, scoring = 'roc_auc')
In [134]:
# Fit the grid search to the data.  
grid_search.fit(X_train, y_train)
grid_search.best_params_
# could take a long time
Fitting 3 folds for each of 12 candidates, totalling 36 fits
Out[134]:
{'bootstrap': True, 'max_depth': 10, 'n_estimators': 200}
In [135]:
rf = RandomForestClassifier(random_state = 42, n_estimators=200, bootstrap = True, max_depth=10,criterion='entropy')
rf.fit(X_train, y_train)
Out[135]:
RandomForestClassifier(criterion='entropy', max_depth=10, n_estimators=200,
                       random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(criterion='entropy', max_depth=10, n_estimators=200,
                       random_state=42)
In [136]:
# focus more on "recall"
y_pred_3 = rf.predict(X_test)

rf_matrix = metrics.confusion_matrix(y_test, y_pred3)
print(rf_matrix)
rf_report = metrics.classification_report(y_test,y_pred3,digits=4)
print(rf_report)
[[38520   139]
 [   47   195]]
              precision    recall  f1-score   support

           0     0.9988    0.9964    0.9976     38659
           1     0.5838    0.8058    0.6771       242

    accuracy                         0.9952     38901
   macro avg     0.7913    0.9011    0.8373     38901
weighted avg     0.9962    0.9952    0.9956     38901

Random Resampling Imbalanced Datasets¶

Random Oversampling Imbalanced Dataset¶

In [101]:
from imblearn.pipeline import Pipeline, make_pipeline
In [87]:
#  Random Oversampling Imbalanced Datasets
from imblearn.over_sampling import RandomOverSampler
# define oversampling strategy
ros = RandomOverSampler(random_state=42)
In [88]:
# fit and apply the transform
X_over, y_over = ros.fit_resample(X_train, y_train)
In [92]:
print('Genuine:', y_over.value_counts()[0], '/', round(y_over.value_counts()[0]/len(y_over) * 100,2), '% of the dataset')
print('Frauds:', y_over.value_counts()[1], '/',round(y_over.value_counts()[1]/len(y_over) * 100,2), '% of the dataset')
Genuine: 90236 / 50.0 % of the dataset
Frauds: 90236 / 50.0 % of the dataset
In [137]:
# Fit the grid search to the data.  
grid_search.fit(X_over, y_over)
grid_search.best_params_
# could take a long time
Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time=   1.6s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time=   3.2s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time=   3.1s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time=   6.5s
[CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time=   2.2s
[CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time=   2.3s
[CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time=   2.2s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time=   4.3s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time=   8.8s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time=   8.9s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time=   5.2s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time=   5.4s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time=  10.8s
[CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time=   2.8s
[CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time=   3.0s
[CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time=   3.0s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time=   5.8s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time=  11.5s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time=  12.0s
[CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time=   2.8s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time=   5.6s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time=   5.6s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time=  11.2s
[CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time=   3.6s
[CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time=   3.5s
[CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time=   3.6s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time=   7.1s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time=  14.5s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time=  14.9s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time=   8.3s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time=   8.4s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time=  17.0s
[CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time=   4.5s
[CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time=   4.6s
[CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time=   4.6s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time=   9.2s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time=  18.6s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time=  18.0s
[CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time=   0.1s
[CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time=   0.1s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time=   0.2s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time=   0.2s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time=   0.3s
[CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time=   0.1s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time=   0.2s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time=   0.2s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time=   0.3s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time=   0.3s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time=   0.2s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time=   0.2s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time=   0.3s
[CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time=   0.1s
[CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time=   0.1s
[CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time=   0.1s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time=   0.2s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time=   0.4s
[CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time=   0.1s
[CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time=   0.1s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time=   0.2s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time=   0.2s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time=   0.3s
[CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time=   0.1s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time=   0.2s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time=   0.2s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time=   0.3s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time=   0.3s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time=   0.2s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time=   0.2s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time=   0.3s
[CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time=   0.1s
[CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time=   0.1s
[CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time=   0.1s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time=   0.2s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time=   0.3s
[CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time=   1.6s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time=   3.2s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time=   3.2s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time=   6.5s
[CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time=   2.2s
[CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time=   2.3s
[CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time=   2.2s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time=   4.3s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time=   8.9s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time=   9.0s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time=   5.6s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time=   5.4s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time=  10.7s
[CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time=   2.8s
[CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time=   3.0s
[CV] END ......bootstrap=True, max_depth=10, n_estimators=50; total time=   2.9s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time=   5.8s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time=  11.3s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time=  11.9s
[CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time=   2.9s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time=   5.5s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time=   5.6s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time=  11.2s
[CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time=   3.5s
[CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time=   3.7s
[CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time=   3.6s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time=   7.3s
[CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time=   1.6s
[CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time=   1.6s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time=   3.2s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time=   6.6s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time=   6.6s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time=   4.4s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time=   4.4s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time=   8.9s
[CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time=   2.7s
[CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time=   2.7s
[CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time=   2.7s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time=   5.3s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time=  10.8s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time=  10.8s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time=   5.6s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time=   6.1s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time=  11.9s
[CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time=   2.8s
[CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time=   2.9s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time=   5.5s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time=  11.1s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time=  11.1s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time=   7.2s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time=   7.4s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time=  14.4s
[CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time=   4.7s
[CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time=   4.4s
[CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time=   4.3s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time=   8.4s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time=  16.9s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time=  16.9s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time=   9.2s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time=   9.2s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time=  18.4s
[CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time=   0.1s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time=   0.2s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time=   0.3s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time=   0.3s
[CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time=   0.1s
[CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time=   0.1s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time=   0.2s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time=   0.3s
[CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time=   0.1s
[CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time=   0.1s
[CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time=   0.1s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time=   0.2s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time=   0.4s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time=   0.4s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time=   0.2s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time=   0.2s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time=   0.4s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time=   0.4s
[CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time=   0.1s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time=   0.2s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time=   0.3s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time=   0.3s
[CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time=   0.1s
[CV] END .......bootstrap=True, max_depth=6, n_estimators=50; total time=   0.1s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time=   0.2s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time=   0.3s
[CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time=   0.1s
[CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time=   0.1s
[CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time=   0.1s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time=   0.2s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time=   0.3s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time=   0.3s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time=   0.2s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time=   0.2s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time=   0.4s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time=   0.4s
[CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time=   1.6s
[CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time=   1.6s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time=   3.2s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time=   6.5s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time=   6.4s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time=   4.3s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time=   4.3s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time=   8.8s
[CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time=   2.7s
[CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time=   2.6s
[CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time=   2.6s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time=   5.7s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time=  10.6s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time=  10.7s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time=   5.6s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=100; total time=   6.1s
[CV] END .....bootstrap=True, max_depth=10, n_estimators=200; total time=  11.8s
[CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time=   2.9s
[CV] END .......bootstrap=True, max_depth=4, n_estimators=50; total time=   2.8s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=100; total time=   5.4s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time=  11.1s
[CV] END ......bootstrap=True, max_depth=4, n_estimators=200; total time=  11.1s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time=   7.1s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=100; total time=   7.2s
[CV] END ......bootstrap=True, max_depth=6, n_estimators=200; total time=  14.9s
[CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time=   4.2s
[CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time=   4.2s
[CV] END .......bootstrap=True, max_depth=8, n_estimators=50; total time=   4.3s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=100; total time=   8.5s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time=  16.7s
[CV] END ......bootstrap=True, max_depth=8, n_estimators=200; total time=  16.9s
Out[137]:
{'bootstrap': True, 'max_depth': 10, 'n_estimators': 100}
In [138]:
rf_over = RandomForestClassifier(random_state = 42, n_estimators=100, bootstrap = True, max_depth=10,criterion='entropy')
rf_over.fit(X_over, y_over)
Out[138]:
RandomForestClassifier(criterion='entropy', max_depth=10, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(criterion='entropy', max_depth=10, random_state=42)
In [139]:
# focus on "recall"
y_pred_over = rf_over.predict(X_test)

rf_over_matrix = metrics.confusion_matrix(y_test, y_pred_over)
print(rf_over_matrix)
rf_over_report = metrics.classification_report(y_test,y_pred_over,digits=4)
print(rf_over_report)
[[38509   150]
 [   48   194]]
              precision    recall  f1-score   support

           0     0.9988    0.9961    0.9974     38659
           1     0.5640    0.8017    0.6621       242

    accuracy                         0.9949     38901
   macro avg     0.7814    0.8989    0.8298     38901
weighted avg     0.9961    0.9949    0.9953     38901

Random Undersampling imbalanced Dataset¶

In [140]:
from imblearn.under_sampling import RandomUnderSampler
# define oversampling strategy
rus = RandomUnderSampler(random_state=42)

# fit and apply the transform
X_under, y_under = rus.fit_resample(X_train, y_train)

print('Genuine:', y_under.value_counts()[0], '/', round(y_under.value_counts()[0]/len(y_under) * 100,2), '% of the dataset')
print('Frauds:', y_under.value_counts()[1], '/',round(y_under.value_counts()[1]/len(y_under) * 100,2), '% of the dataset')
Genuine: 531 / 50.0 % of the dataset
Frauds: 531 / 50.0 % of the dataset
In [141]:
# Fit the grid search to the data.  
grid_search.fit(X_under, y_under)
grid_search.best_params_
# could take a long time
Fitting 3 folds for each of 12 candidates, totalling 36 fits
Out[141]:
{'bootstrap': True, 'max_depth': 8, 'n_estimators': 200}
In [142]:
rf_under = RandomForestClassifier(random_state = 42, n_estimators=200, bootstrap = True, max_depth=8,criterion='entropy')
rf_under.fit(X_under, y_under)
Out[142]:
RandomForestClassifier(criterion='entropy', max_depth=8, n_estimators=200,
                       random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(criterion='entropy', max_depth=8, n_estimators=200,
                       random_state=42)
In [143]:
# focus on "recall"
y_pred_under = rf_under.predict(X_test)

rf_under_matrix = metrics.confusion_matrix(y_test, y_pred_under)
print(rf_under_matrix)
rf_under_report = metrics.classification_report(y_test,y_pred_under,digits=4)
print(rf_under_report)
[[36258  2401]
 [   15   227]]
              precision    recall  f1-score   support

           0     0.9996    0.9379    0.9678     38659
           1     0.0864    0.9380    0.1582       242

    accuracy                         0.9379     38901
   macro avg     0.5430    0.9380    0.5630     38901
weighted avg     0.9939    0.9379    0.9627     38901

In [163]:
# Since the recall is high, we could change the threshold to balance precision and recall.
threshold = 0.7
y_pred_under2 = (rf_under.predict_proba(X_test)[:, 1] > threshold).astype('float')

rf_under_matrix2 = metrics.confusion_matrix(y_test, y_pred_under2)
print(rf_under_matrix2)
rf_under_report2 = metrics.classification_report(y_test,y_pred_under2,digits=4)
print(rf_under_report2)
[[38134   525]
 [   41   201]]
              precision    recall  f1-score   support

           0     0.9989    0.9864    0.9926     38659
           1     0.2769    0.8306    0.4153       242

    accuracy                         0.9855     38901
   macro avg     0.6379    0.9085    0.7040     38901
weighted avg     0.9944    0.9855    0.9890     38901

SMOTE (Synthetic Minority Oversampling Technique)¶

In [151]:
from imblearn.over_sampling import SMOTE
In [153]:
sm = SMOTE(sampling_strategy = 'minority', random_state=42)
X_smote, y_smote = sm.fit_resample(X_train, y_train)
In [154]:
print('Genuine:', y_smote.value_counts()[0], '/', round(y_smote.value_counts()[0]/len(y_smote) * 100,2), '% of the dataset')
print('Frauds:', y_smote.value_counts()[1], '/',round(y_smote.value_counts()[1]/len(y_smote) * 100,2), '% of the dataset')
Genuine: 90236 / 50.0 % of the dataset
Frauds: 90236 / 50.0 % of the dataset
In [155]:
# Fit the grid search to the data.  
grid_search.fit(X_smote, y_smote)
grid_search.best_params_
# could take a long time
Fitting 3 folds for each of 12 candidates, totalling 36 fits
Out[155]:
{'bootstrap': True, 'max_depth': 10, 'n_estimators': 100}
In [156]:
rf_smote = RandomForestClassifier(random_state = 42, n_estimators=100, bootstrap = True, max_depth=10,criterion='entropy')
rf_smote.fit(X_smote, y_smote)
Out[156]:
RandomForestClassifier(criterion='entropy', max_depth=10, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(criterion='entropy', max_depth=10, random_state=42)
In [157]:
y_pred_smote = rf_smote.predict(X_test)

rf_smote_matrix = metrics.confusion_matrix(y_test, y_pred_smote)
print(rf_smote_matrix)
rf_smote_report = metrics.classification_report(y_test,y_pred_smote,digits=4)
print(rf_smote_report)
[[37893   766]
 [   59   183]]
              precision    recall  f1-score   support

           0     0.9984    0.9802    0.9892     38659
           1     0.1928    0.7562    0.3073       242

    accuracy                         0.9788     38901
   macro avg     0.5956    0.8682    0.6483     38901
weighted avg     0.9934    0.9788    0.9850     38901

Undersampling using Tomek Links¶

In [158]:
from imblearn.under_sampling import TomekLinks

# define the undersampling method
#tomekU = TomekLinks(sampling_strategy='auto', n_jobs=-1)
tomekU = TomekLinks()

# fit and apply the transform
X_underT, y_underT = tomekU.fit_resample(X_train, y_train)
print('Genuine:', y_underT.value_counts()[0], '/', round(y_underT.value_counts()[0]/len(y_underT) * 100,2), '% of the dataset')
print('Frauds:', y_underT.value_counts()[1], '/',round(y_underT.value_counts()[1]/len(y_underT) * 100,2), '% of the dataset')
Genuine: 89980 / 99.41 % of the dataset
Frauds: 531 / 0.59 % of the dataset
In [159]:
# Fit the grid search to the data.  
grid_search.fit(X_underT, y_underT)
grid_search.best_params_
# could take a long time
Fitting 3 folds for each of 12 candidates, totalling 36 fits
Out[159]:
{'bootstrap': True, 'max_depth': 10, 'n_estimators': 100}
In [160]:
rf_underT = RandomForestClassifier(random_state = 42, n_estimators=100, bootstrap = True, max_depth=10,criterion='entropy')
rf_underT.fit(X_underT, y_underT)
Out[160]:
RandomForestClassifier(criterion='entropy', max_depth=10, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(criterion='entropy', max_depth=10, random_state=42)
In [161]:
y_pred_underT = rf_underT.predict(X_test)

rf_underT_matrix = metrics.confusion_matrix(y_test, y_pred_underT)
print(rf_underT_matrix)
rf_underT_report = metrics.classification_report(y_test,y_pred_underT,digits=4)
print(rf_underT_report)
[[38654     5]
 [  109   133]]
              precision    recall  f1-score   support

           0     0.9972    0.9999    0.9985     38659
           1     0.9638    0.5496    0.7000       242

    accuracy                         0.9971     38901
   macro avg     0.9805    0.7747    0.8493     38901
weighted avg     0.9970    0.9971    0.9967     38901

In [166]:
# Since the precision is high, we could change the threshold to balance precision and recall.
threshold = 0.1
y_pred_underT2 = (rf_underT.predict_proba(X_test)[:, 1] > threshold).astype('float')

rf_underT_matrix2 = metrics.confusion_matrix(y_test, y_pred_underT2)
print(rf_underT_matrix2)
rf_underT_report2 = metrics.classification_report(y_test,y_pred_underT2,digits=4)
print(rf_underT_report2)
[[38460   199]
 [   41   201]]
              precision    recall  f1-score   support

           0     0.9989    0.9949    0.9969     38659
           1     0.5025    0.8306    0.6262       242

    accuracy                         0.9938     38901
   macro avg     0.7507    0.9127    0.8115     38901
weighted avg     0.9958    0.9938    0.9946     38901

Combining SMOTE and Tomek Links¶

In [167]:
from imblearn.combine import SMOTETomek
In [168]:
st = SMOTETomek()

# fit and apply the transform
X_st, y_st = st.fit_resample(X_train, y_train)
print('Genuine:', y_st.value_counts()[0], '/', round(y_st.value_counts()[0]/len(y_st) * 100,2), '% of the dataset')
print('Frauds:', y_st.value_counts()[1], '/',round(y_st.value_counts()[1]/len(y_st) * 100,2), '% of the dataset')
Genuine: 90233 / 50.0 % of the dataset
Frauds: 90233 / 50.0 % of the dataset
In [169]:
# Fit the grid search to the data.  
grid_search.fit(X_st, y_st)
grid_search.best_params_
# could take a long time
Fitting 3 folds for each of 12 candidates, totalling 36 fits
Out[169]:
{'bootstrap': True, 'max_depth': 10, 'n_estimators': 100}
In [172]:
rf_st = RandomForestClassifier(random_state = 42, n_estimators=100, bootstrap = True, max_depth=10,criterion='entropy')
rf_st.fit(X_st, y_st)
Out[172]:
RandomForestClassifier(criterion='entropy', max_depth=10, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(criterion='entropy', max_depth=10, random_state=42)
In [180]:
y_pred_st = rf_st.predict(X_test)

rf_st_matrix = metrics.confusion_matrix(y_test, y_pred_st)
print(rf_st_matrix)
rf_st_report = metrics.classification_report(y_test,y_pred_st,digits=4)
print(rf_st_report)
[[37837   822]
 [   56   186]]
              precision    recall  f1-score   support

           0     0.9985    0.9787    0.9885     38659
           1     0.1845    0.7686    0.2976       242

    accuracy                         0.9774     38901
   macro avg     0.5915    0.8737    0.6431     38901
weighted avg     0.9935    0.9774    0.9842     38901

Class Weights in the models¶

Most of the machine learning models provide a parameter called class_weights. For example, in a random forest classifier using, class_weights we can specify a higher weight for the minority class using a dictionary.

Without weights set, the model treats each point as equally important. Weights scale the loss function. As the model trains on each point, the error will be multiplied by the weight of the point. The estimator will try to minimize error on the more heavily weighted classes, because they will have a greater effect on error, sending a stronger signal.

In [175]:
# If you choose class_weight = "balanced", 
# the classes will be weighted inversely proportional to how frequently they appear in the data.

rfb = RandomForestClassifier(random_state=42, class_weight="balanced")
In [176]:
grid_search = GridSearchCV(estimator = rfb, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2, scoring = 'roc_auc')
In [177]:
# Fit the grid search to the data.  
grid_search.fit(X_train, y_train)
grid_search.best_params_
# could take a long time
Fitting 3 folds for each of 12 candidates, totalling 36 fits
Out[177]:
{'bootstrap': True, 'max_depth': 10, 'n_estimators': 200}
In [183]:
rfb = RandomForestClassifier(random_state = 42, class_weight="balanced", n_estimators=200, bootstrap = True, max_depth=10,criterion='entropy')
rfb.fit(X_train, y_train)
Out[183]:
RandomForestClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=10, n_estimators=200, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=10, n_estimators=200, random_state=42)
In [184]:
y_pred_rfb = rfb.predict(X_test)

rfb_matrix = metrics.confusion_matrix(y_test, y_pred_rfb)
print(rfb_matrix)
rfb_report = metrics.classification_report(y_test,y_pred_rfb,digits=4)
print(rfb_report)
[[38588    71]
 [   59   183]]
              precision    recall  f1-score   support

           0     0.9985    0.9982    0.9983     38659
           1     0.7205    0.7562    0.7379       242

    accuracy                         0.9967     38901
   macro avg     0.8595    0.8772    0.8681     38901
weighted avg     0.9967    0.9967    0.9967     38901

Performance Comparison¶

In [191]:
#set up plotting area
plt.figure(0).clf()
#fit decisiom tree model and plot ROC curve
y_pred = rf.predict_proba(X_test)[:, 1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred)
auc = round(metrics.roc_auc_score(y_test, y_pred), 4)
plt.plot(fpr,tpr,label="Base, AUC="+str(auc))

#fit random forest model and plot ROC curve

y_pred = rf_over.predict_proba(X_test)[:, 1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred)
auc = round(metrics.roc_auc_score(y_test, y_pred), 4)
plt.plot(fpr,tpr,label="Oversampling, AUC="+str(auc))


y_pred = rf_under.predict_proba(X_test)[:, 1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred)
auc = round(metrics.roc_auc_score(y_test, y_pred), 4)
plt.plot(fpr,tpr,label="Undersampling, AUC="+str(auc))

y_pred = rf_smote.predict_proba(X_test)[:, 1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred)
auc = round(metrics.roc_auc_score(y_test, y_pred), 4)
plt.plot(fpr,tpr,label="SMOTE, AUC="+str(auc))

y_pred = rf_underT.predict_proba(X_test)[:, 1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred)
auc = round(metrics.roc_auc_score(y_test, y_pred), 4)
plt.plot(fpr,tpr,label="Tomek_Links, AUC="+str(auc))

y_pred = rf_st.predict_proba(X_test)[:, 1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred)
auc = round(metrics.roc_auc_score(y_test, y_pred), 4)
plt.plot(fpr,tpr,label="SMOTE_Tomek, AUC="+str(auc))

y_pred = rfb.predict_proba(X_test)[:, 1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred)
auc = round(metrics.roc_auc_score(y_test, y_pred), 4)
plt.plot(fpr,tpr,label="Class_Weights, AUC="+str(auc))

plt.plot([0, 1], [0, 1], color='purple', linestyle='--')
plt.title(" AUC Comparison")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
#add legend
plt.legend()
Out[191]:
<matplotlib.legend.Legend at 0x7fe581887910>

Model Comparison

Metric/Method Base Oversampling Undersampling SMOTE Tomek_Links SMOTE_Tomek Class_Weights
Accuracy 0.9952 0.9949 0.9379 0.9788 0.9971 0.9774 0.9967
Precision 0.5838 0.5640 0.0864 0.1928 0.9638 0.1845 0.7205
Recall 0.8058 0.8017 0.9380 0.7562 0.5496 0.7686 0.7562
F1-Score 0.6771 0.6621 0.1582 0.3073 0.7000 0.2976 0.7379
AUC 0.9873 0.9883 0.9818 0.9467 0.9497 0.9474 0.9474 0.9888

Based on these metrics, Class Weights method has the best performace. We will choose the Random Forest Model with Class Weights method.

Conclusion¶

  • We have done the exploratory data analysis and found that amt, category, transaction time, age, state, and city could be good predictors to classify fraud transactions.
  • We have generated some new features, such as trans_hour, trans_day_of_week, and distance.
  • We have created the Decision Tree model as a base model and the Random Forest model as a comparison model. The Random Forest model performs better.
  • From the Random Forest model, we found that the top 3 important features are amt, trans_hour, and category, which is consistent with the finding from EDA.
  • We tried to deal with the imbalanced dataset and improve model performance with 6 different methods: Oversampling, Undersampling, SMOTE, Tomek Links, SMOTE Tomek, and Class Weights. The class Weights method performs best.
  • We have done hyperparameters for all models and imbalanced improving methods to ensure our model could be robust.
  • There is much more to do when it comes to K-Fold Cross Validation and tuning hyperparameters.

References:¶

  1. https://www.ncr.com/blogs/payments/credit-card-fraud-detection
  2. https://www.kaggle.com/datasets/kartik2112/fraud-detection
  3. https://www.kaggle.com/code/marcinrutecki/best-techniques-and-metrics-for-imbalanced-dataset